In this case study, I analyzed a data set with multiple potential features to determine which features were highly telling in attrition rates. Out of the 840 records given, there were 140 that landed in the numerator for attrition rates. In the first two chunks of code, I loaded int the libraries that would be used for this case study and created several plots to determine contributing factors
Some of the most correlated data points in attrition rates were: Monthly income/Monthly Rate Job Level Overtime Age
Once I was able to figure out several data points that were highly correlated with attrition rates, I ran several predictive models (KNN, Naive Bayes, and Random Forest) to predict if an employee would fall in to the denominator for attrition rates.
This model performed fairly well in accuracy (84%) and Sensitivity (99.2%), but failed to have high specificity rates. This was concerning as there were a lot of false positives in this modle
Similarly to the KNN model, the Naive Bayes model performed well in accuracy (85%), but the sensitivity and specificity were opposite of the KNN model. The Naive Bayes model was able to accurately predict the employees who were not in the numerator of the attrition rate, but was very inaccurate in that it predicted a lot of false positives.
The random forest model was by far the best model that I ran for predicting attrition rates. It’s accuracy (80.8%) was slightly lower than both the KNN and Naive Bayes models, but it’s sensitivity and specificity were drastically higher (61.5% and 84.2% respectfully). This meant that this model was able to more accurately predict true positives and true negatives while not sacrificing accuracy for false positives and false negatives
I then ran two separate models for predicting salaries for employees. The highest correlating fields were Job Level, Total Working Years and Age. This makes sense as the older you are, the more years you would have worked and your job level is usually higher.
When running a random forest model using the fields provided above, I was able to return a RMSE of 1242.69. This was slightly better than the RMSE when using all fields in this model (RMSE of 1278.82)
I then wanted to compare the Random Forest model to a linear regression model, because salary theoretically should be linearly correlated with age, job level and years working. However I found that this models RMSE (1333.81) was slightly higher than the Random Forest model and therefore did not predict salary as accurate.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.1 ✓ dplyr 1.0.6
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(ggplot2)
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(ggExtra)
library(e1071)
library(class)
library(ggthemes)
library(naniar)
library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following objects are masked from 'package:e1071':
##
## kurtosis, skewness
## The following object is masked from 'package:graphics':
##
## legend
library(corrplot)
## corrplot 0.90 loaded
library(RColorBrewer)
library(funModeling)
## Loading required package: Hmisc
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:e1071':
##
## impute
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
## funModeling v.1.9.4 :)
## Examples and tutorials at livebook.datascienceheroes.com
## / Now in Spanish: librovivodecienciadedatos.ai
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(stringr)
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:Hmisc':
##
## is.discrete, summarize
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(devtools)
## Loading required package: usethis
library(rpart)
library(caTools)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(lattice)
library(rmarkdown)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(readxl)
library(knitr)
# reading in data
df = read_csv('~/Desktop/MSDS/Doing Data Science/MSDS_6306_Doing-Data-Science-Master/Unit 14 and 15 Case Study 2/CaseStudy2-data.csv')
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## Attrition = col_character(),
## BusinessTravel = col_character(),
## Department = col_character(),
## EducationField = col_character(),
## Gender = col_character(),
## JobRole = col_character(),
## MaritalStatus = col_character(),
## Over18 = col_character(),
## OverTime = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
#looking at high level summary stats of columns
head(df)
## # A tibble: 6 x 36
## ID Age Attrition BusinessTravel DailyRate Department DistanceFromHome
## <dbl> <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 32 No Travel_Rarely 117 Sales 13
## 2 2 40 No Travel_Rarely 1308 Research & D… 14
## 3 3 35 No Travel_Frequen… 200 Research & D… 18
## 4 4 32 No Travel_Rarely 801 Sales 1
## 5 5 24 No Travel_Frequen… 567 Research & D… 2
## 6 6 27 No Travel_Frequen… 294 Research & D… 10
## # … with 29 more variables: Education <dbl>, EducationField <chr>,
## # EmployeeCount <dbl>, EmployeeNumber <dbl>, EnvironmentSatisfaction <dbl>,
## # Gender <chr>, HourlyRate <dbl>, JobInvolvement <dbl>, JobLevel <dbl>,
## # JobRole <chr>, JobSatisfaction <dbl>, MaritalStatus <chr>,
## # MonthlyIncome <dbl>, MonthlyRate <dbl>, NumCompaniesWorked <dbl>,
## # Over18 <chr>, OverTime <chr>, PercentSalaryHike <dbl>,
## # PerformanceRating <dbl>, RelationshipSatisfaction <dbl>,
## # StandardHours <dbl>, StockOptionLevel <dbl>, TotalWorkingYears <dbl>,
## # TrainingTimesLastYear <dbl>, WorkLifeBalance <dbl>, YearsAtCompany <dbl>,
## # YearsInCurrentRole <dbl>, YearsSinceLastPromotion <dbl>,
## # YearsWithCurrManager <dbl>
summary(df)
## ID Age Attrition BusinessTravel
## Min. : 1.0 Min. :18.00 Length:870 Length:870
## 1st Qu.:218.2 1st Qu.:30.00 Class :character Class :character
## Median :435.5 Median :35.00 Mode :character Mode :character
## Mean :435.5 Mean :36.83
## 3rd Qu.:652.8 3rd Qu.:43.00
## Max. :870.0 Max. :60.00
## DailyRate Department DistanceFromHome Education
## Min. : 103.0 Length:870 Min. : 1.000 Min. :1.000
## 1st Qu.: 472.5 Class :character 1st Qu.: 2.000 1st Qu.:2.000
## Median : 817.5 Mode :character Median : 7.000 Median :3.000
## Mean : 815.2 Mean : 9.339 Mean :2.901
## 3rd Qu.:1165.8 3rd Qu.:14.000 3rd Qu.:4.000
## Max. :1499.0 Max. :29.000 Max. :5.000
## EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction
## Length:870 Min. :1 Min. : 1.0 Min. :1.000
## Class :character 1st Qu.:1 1st Qu.: 477.2 1st Qu.:2.000
## Mode :character Median :1 Median :1039.0 Median :3.000
## Mean :1 Mean :1029.8 Mean :2.701
## 3rd Qu.:1 3rd Qu.:1561.5 3rd Qu.:4.000
## Max. :1 Max. :2064.0 Max. :4.000
## Gender HourlyRate JobInvolvement JobLevel
## Length:870 Min. : 30.00 Min. :1.000 Min. :1.000
## Class :character 1st Qu.: 48.00 1st Qu.:2.000 1st Qu.:1.000
## Mode :character Median : 66.00 Median :3.000 Median :2.000
## Mean : 65.61 Mean :2.723 Mean :2.039
## 3rd Qu.: 83.00 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :100.00 Max. :4.000 Max. :5.000
## JobRole JobSatisfaction MaritalStatus MonthlyIncome
## Length:870 Min. :1.000 Length:870 Min. : 1081
## Class :character 1st Qu.:2.000 Class :character 1st Qu.: 2840
## Mode :character Median :3.000 Mode :character Median : 4946
## Mean :2.709 Mean : 6390
## 3rd Qu.:4.000 3rd Qu.: 8182
## Max. :4.000 Max. :19999
## MonthlyRate NumCompaniesWorked Over18 OverTime
## Min. : 2094 Min. :0.000 Length:870 Length:870
## 1st Qu.: 8092 1st Qu.:1.000 Class :character Class :character
## Median :14074 Median :2.000 Mode :character Mode :character
## Mean :14326 Mean :2.728
## 3rd Qu.:20456 3rd Qu.:4.000
## Max. :26997 Max. :9.000
## PercentSalaryHike PerformanceRating RelationshipSatisfaction StandardHours
## Min. :11.0 Min. :3.000 Min. :1.000 Min. :80
## 1st Qu.:12.0 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:80
## Median :14.0 Median :3.000 Median :3.000 Median :80
## Mean :15.2 Mean :3.152 Mean :2.707 Mean :80
## 3rd Qu.:18.0 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:80
## Max. :25.0 Max. :4.000 Max. :4.000 Max. :80
## StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
## Min. :0.0000 Min. : 0.00 Min. :0.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.: 6.00 1st Qu.:2.000 1st Qu.:2.000
## Median :1.0000 Median :10.00 Median :3.000 Median :3.000
## Mean :0.7839 Mean :11.05 Mean :2.832 Mean :2.782
## 3rd Qu.:1.0000 3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :3.0000 Max. :40.00 Max. :6.000 Max. :4.000
## YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.: 2.000 1st Qu.: 0.000
## Median : 5.000 Median : 3.000 Median : 1.000
## Mean : 6.962 Mean : 4.205 Mean : 2.169
## 3rd Qu.:10.000 3rd Qu.: 7.000 3rd Qu.: 3.000
## Max. :40.000 Max. :18.000 Max. :15.000
## YearsWithCurrManager
## Min. : 0.00
## 1st Qu.: 2.00
## Median : 3.00
## Mean : 4.14
## 3rd Qu.: 7.00
## Max. :17.00
str(df)
## spec_tbl_df[,36] [870 × 36] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ID : num [1:870] 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : num [1:870] 32 40 35 32 24 27 41 37 34 34 ...
## $ Attrition : chr [1:870] "No" "No" "No" "No" ...
## $ BusinessTravel : chr [1:870] "Travel_Rarely" "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" ...
## $ DailyRate : num [1:870] 117 1308 200 801 567 ...
## $ Department : chr [1:870] "Sales" "Research & Development" "Research & Development" "Sales" ...
## $ DistanceFromHome : num [1:870] 13 14 18 1 2 10 5 10 10 10 ...
## $ Education : num [1:870] 4 3 2 4 1 2 5 4 4 4 ...
## $ EducationField : chr [1:870] "Life Sciences" "Medical" "Life Sciences" "Marketing" ...
## $ EmployeeCount : num [1:870] 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : num [1:870] 859 1128 1412 2016 1646 ...
## $ EnvironmentSatisfaction : num [1:870] 2 3 3 3 1 4 2 4 3 4 ...
## $ Gender : chr [1:870] "Male" "Male" "Male" "Female" ...
## $ HourlyRate : num [1:870] 73 44 60 48 32 32 90 88 87 92 ...
## $ JobInvolvement : num [1:870] 3 2 3 3 3 3 4 2 3 2 ...
## $ JobLevel : num [1:870] 2 5 3 3 1 3 1 2 1 2 ...
## $ JobRole : chr [1:870] "Sales Executive" "Research Director" "Manufacturing Director" "Sales Executive" ...
## $ JobSatisfaction : num [1:870] 4 3 4 4 4 1 3 4 3 3 ...
## $ MaritalStatus : chr [1:870] "Divorced" "Single" "Single" "Married" ...
## $ MonthlyIncome : num [1:870] 4403 19626 9362 10422 3760 ...
## $ MonthlyRate : num [1:870] 9250 17544 19944 24032 17218 ...
## $ NumCompaniesWorked : num [1:870] 2 1 2 1 1 1 2 2 1 1 ...
## $ Over18 : chr [1:870] "Y" "Y" "Y" "Y" ...
## $ OverTime : chr [1:870] "No" "No" "No" "No" ...
## $ PercentSalaryHike : num [1:870] 11 14 11 19 13 21 12 14 19 14 ...
## $ PerformanceRating : num [1:870] 3 3 3 3 3 4 3 3 3 3 ...
## $ RelationshipSatisfaction: num [1:870] 3 1 3 3 3 3 1 3 4 2 ...
## $ StandardHours : num [1:870] 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : num [1:870] 1 0 0 2 0 2 0 3 1 1 ...
## $ TotalWorkingYears : num [1:870] 8 21 10 14 6 9 7 8 1 8 ...
## $ TrainingTimesLastYear : num [1:870] 3 2 2 3 2 4 5 5 2 3 ...
## $ WorkLifeBalance : num [1:870] 2 4 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : num [1:870] 5 20 2 14 6 9 4 1 1 8 ...
## $ YearsInCurrentRole : num [1:870] 2 7 2 10 3 7 2 0 1 2 ...
## $ YearsSinceLastPromotion : num [1:870] 0 4 2 5 1 1 0 0 0 7 ...
## $ YearsWithCurrManager : num [1:870] 3 9 2 7 3 7 3 0 0 7 ...
## - attr(*, "spec")=
## .. cols(
## .. ID = col_double(),
## .. Age = col_double(),
## .. Attrition = col_character(),
## .. BusinessTravel = col_character(),
## .. DailyRate = col_double(),
## .. Department = col_character(),
## .. DistanceFromHome = col_double(),
## .. Education = col_double(),
## .. EducationField = col_character(),
## .. EmployeeCount = col_double(),
## .. EmployeeNumber = col_double(),
## .. EnvironmentSatisfaction = col_double(),
## .. Gender = col_character(),
## .. HourlyRate = col_double(),
## .. JobInvolvement = col_double(),
## .. JobLevel = col_double(),
## .. JobRole = col_character(),
## .. JobSatisfaction = col_double(),
## .. MaritalStatus = col_character(),
## .. MonthlyIncome = col_double(),
## .. MonthlyRate = col_double(),
## .. NumCompaniesWorked = col_double(),
## .. Over18 = col_character(),
## .. OverTime = col_character(),
## .. PercentSalaryHike = col_double(),
## .. PerformanceRating = col_double(),
## .. RelationshipSatisfaction = col_double(),
## .. StandardHours = col_double(),
## .. StockOptionLevel = col_double(),
## .. TotalWorkingYears = col_double(),
## .. TrainingTimesLastYear = col_double(),
## .. WorkLifeBalance = col_double(),
## .. YearsAtCompany = col_double(),
## .. YearsInCurrentRole = col_double(),
## .. YearsSinceLastPromotion = col_double(),
## .. YearsWithCurrManager = col_double()
## .. )
# checking for null values in columns
colSums(is.na(df))
## ID Age Attrition
## 0 0 0
## BusinessTravel DailyRate Department
## 0 0 0
## DistanceFromHome Education EducationField
## 0 0 0
## EmployeeCount EmployeeNumber EnvironmentSatisfaction
## 0 0 0
## Gender HourlyRate JobInvolvement
## 0 0 0
## JobLevel JobRole JobSatisfaction
## 0 0 0
## MaritalStatus MonthlyIncome MonthlyRate
## 0 0 0
## NumCompaniesWorked Over18 OverTime
## 0 0 0
## PercentSalaryHike PerformanceRating RelationshipSatisfaction
## 0 0 0
## StandardHours StockOptionLevel TotalWorkingYears
## 0 0 0
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 0 0 0
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## 0 0 0
gg_miss_var(df,show_pct = TRUE)
# normalizing column
df = df %>% mutate(BusinessTravel = ifelse(BusinessTravel == 'Non_Travel','No','Yes'))
df %>% ggplot()+
geom_boxplot(aes(x=Attrition, y=MonthlyIncome, fill=Attrition), alpha=0.7)+
ggtitle("Attrition by Monthly Income")+
scale_y_continuous(name='Monthly Income')+
theme(plot.title=element_text(size=20, color="Black"))+
scale_fill_manual(values=c("Green","Red"))
df %>% ggplot()+
geom_boxplot(aes(x=Attrition, y=MonthlyRate, fill = Attrition), alpha=0.7)+
ggtitle("Attrition by Monthly Rate")+
scale_y_continuous(name="Monthly Rate")+
scale_fill_manual(values=c("Green","Red"))+
theme(plot.title=element_text(size=20, color="Black"))
# Environment Satisfaction v Attrition
df %>% ggplot()+
geom_boxplot(aes(x=Attrition, y=EnvironmentSatisfaction, fill=Attrition), alpha=0.7)+
ggtitle("Attrition by Envoronment Satisfaction")+
scale_y_continuous(name='Environment Satisfaction')+
theme(plot.title=element_text(size=20, color="Black"))+
scale_fill_manual(values=c("Green","Red"))
df %>% ggplot()+
geom_point(aes(x=Age, y=MonthlyRate,col = Attrition), alpha=0.7)+
ggtitle("Monthly Rate v Age by Attrition")+
scale_y_continuous(name='Monthly Rate')+
theme(plot.title=element_text(size=20, color='Black'))+
scale_color_manual(values=c("Green","Red"))+
facet_wrap(vars(JobLevel))
df %>% ggplot()+
geom_point(aes(x=JobLevel, y=DailyRate, color=Attrition), alpha=0.7)+
ggtitle("Job Level vs Daily Rate")+
scale_y_continuous(name="Years in Current Role")+
theme(plot.title=element_text(size=20, color="Black")) +
scale_color_manual(values=c("Green","Red"))
df %>% ggplot()+
geom_point(aes(x=DailyRate, y=YearsInCurrentRole, color=Attrition), alpha=0.7,position = 'jitter')+
ggtitle("Daily Rate vs Years in Current Role")+
scale_y_continuous(name="Years in Current Role")+
theme(plot.title=element_text(size=20, color="Black")) +
scale_color_manual(values=c("Green","Red"))
df %>% ggplot()+
geom_point(aes(x=NumCompaniesWorked ,y =YearsInCurrentRole ,color = Attrition),alpha = 0.7,position = 'jitter')+
ggtitle('Number of Companies Worked vs Years in Current Role')+
scale_color_manual(values=c("Green","Red"))
df %>% ggplot()+
geom_point(aes(x=YearsInCurrentRole ,y = YearsWithCurrManager ,color = Attrition),alpha = 0.7,position = 'jitter')+
ggtitle('Years in Current Role vs Years with Current Manager')
# Relationship between attrition and overtime
# overtime = df%>%
# group_by(OverTime)%>%
# count(Attrition)%>%
# mutate(AttritionRate=percent(n/sum(n)))
# overtime
#
# df %>% ggplot(aes(x=OverTime, fill=Attrition))+
# geom_bar(alpha=0.7)+
# geom_text(data=overtime, aes(y=n,label=AttritionRate), position=position_stack(vjust=0.5), size=3)+
# ggtitle("Attrition by Overtime")+
# scale_x_discrete(name="Overtime")+
# scale_y_continuous(name="Number of Employees")+
# theme(plot.title=element_text(size=20, color="Black"))+
# scale_fill_manual(values=c("Green", "Red"))
# Relationship between monthly income, work life balance, and attrition
df %>%
mutate(Attrition=ifelse(Attrition=="Yes"," Yes (140)","No (730)"))%>%
ggplot()+
geom_boxplot(aes(x=Attrition, y=MonthlyIncome, fill=Attrition), alpha=0.7)+
ggtitle("Attrition by Monthly Income")+
scale_y_continuous(name='Monthly Income')+
theme(plot.title=element_text(size=20, color="Black"))+
scale_fill_manual(values=c("Green","Red"))
# Determining if Attrition is based off of Income (Monthly Rate and Income)
p = df %>%
mutate(Attrition=ifelse(Attrition=="Yes"," Yes (140)","No (730)"))%>%
ggplot()+
geom_boxplot(aes(x=Attrition, y=MonthlyIncome, fill=Attrition), alpha=0.7)+
ggtitle("Attrition by Monthly Income")+
scale_y_continuous(name='Monthly Income')+
theme(plot.title=element_text(size=15, color="Black"))+
scale_fill_manual(values=c("Green","Red"))
p
p0 = df %>%
mutate(Attrition=ifelse(Attrition=="Yes"," Yes (140)","No (730)"))%>%
ggplot()+
geom_boxplot(aes(x=Attrition, y=MonthlyRate, fill = Attrition), alpha=0.7)+
ggtitle("Attrition by Monthly Rate")+
scale_y_continuous(name="Monthly Rate")+
theme(plot.title=element_text(size=15, color="Black"))+
scale_fill_manual(values=c("Green","Red"))
p0
grid.arrange(p,p0)
#Attrition based on job level
#
# level = df%>%
# group_by(JobLevel)%>%
# count(Attrition)%>%
# mutate(AttritionRate=percent(n/sum(n)))
# level
#
# df %>%
# ggplot( aes(x=JobLevel, fill=Attrition))+
# geom_bar(alpha=0.7)+
# geom_text(data=level, aes(y=n,label=AttritionRate), position=position_stack(vjust=0.5), size=3)+
# ggtitle("Attrition by Job Level")+
# scale_x_discrete(name="Job Level", labels = df$JobLevel)+
# scale_y_continuous(name="# of employees")+
# theme(plot.title=element_text(size=20, color="Black"))+
# scale_fill_manual(values=c("Green", "Red"))
#
#
# p1 = df %>%
# ggplot( aes(x=JobLevel, fill=Attrition))+
# geom_bar(alpha=0.7)+
# geom_text(data=level, aes(y=n,label=AttritionRate), position=position_stack(vjust=0.5), size=3)+
# ggtitle("Attrition by Job Level")+
# scale_x_discrete(name="Job Level", labels = df$JobLevel)+
# scale_y_continuous(name="# of employees")+
# theme(plot.title=element_text(size=20, color="Black"))+
# scale_fill_manual(values=c("Green", "Red"))+
# p1
# grid.arrange(p,p1)
# p2 = df %>%
# ggplot(aes(x=OverTime, fill=Attrition))+
# geom_bar(alpha=0.7)+
# geom_text(data=overtime, aes(y=n,label=AttritionRate), position=position_stack(vjust=0.5), size=3)+
# ggtitle("Attrition by Overtime")+
# scale_x_discrete(name="Overtime")+
# scale_y_continuous(name="Number of Employees")+
# theme(plot.title=element_text(size=20, color="Black"))+
# scale_fill_manual(values=c("Green", "Red"))
# p2
# grid.arrange(p1,p2)
dfKNN= read.csv('~/Desktop/MSDS/Doing Data Science/MSDS_6306_Doing-Data-Science-Master/Unit 14 and 15 Case Study 2/CaseStudy2-data.csv',header = TRUE)
set.seed(57)
splitPerc = .70
trainAttrKNN= sample(1:dim(dfKNN)[1],round(splitPerc * dim(dfKNN)[1]))
trainKNN = dfKNN[trainAttrKNN,]
testKNN = dfKNN[-trainAttrKNN,]
#creation of KNN model using leave one out method
# classification = knn.cv(dfKNN[,c(2,6,14,16,18,20,22)],dfKNN$Attrition,prob = TRUE, k = 9)
# table(classification,dfKNN$Attrition)
# confusionMatrix(table(classification,dfKNN$Attrition))
# determining best k
set.seed(57)
#running KNN model 90 times to find best k parameter
accs = data.frame(accuracy = numeric(90), k = numeric(90))
for(i in 1:90)
{
classification = knn.cv(dfKNN[,c(2,14,16,18,20,22)],dfKNN$Attrition,prob = TRUE, k = i)
table(classification,dfKNN$Attrition)
CM = confusionMatrix(table(classification,dfKNN$Attrition))
accs$accuracy[i] = CM$overall[1]
accs$k[i] = i
}
plot(accs$k,accs$accuracy, type = "l", xlab = "k")
abline(v=accs$k[which.max(accs$accuracy)], col="red")
accs$k[which.max(accs$accuracy)]
## [1] 19
set.seed(57)
#use tuned parameter from code above
classification = knn.cv(df[,c(2,14,16,18,20,22)],dfKNN$Attrition,prob = TRUE, k = 19)
table(classification,dfKNN$Attrition)
##
## classification No Yes
## No 724 134
## Yes 6 6
confusionMatrix(table(classification,dfKNN$Attrition))
## Confusion Matrix and Statistics
##
##
## classification No Yes
## No 724 134
## Yes 6 6
##
## Accuracy : 0.8391
## 95% CI : (0.8129, 0.8629)
## No Information Rate : 0.8391
## P-Value [Acc > NIR] : 0.5225
##
## Kappa : 0.0549
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.99178
## Specificity : 0.04286
## Pos Pred Value : 0.84382
## Neg Pred Value : 0.50000
## Prevalence : 0.83908
## Detection Rate : 0.83218
## Detection Prevalence : 0.98621
## Balanced Accuracy : 0.51732
##
## 'Positive' Class : No
##
set.seed(57)
dfNB = read.csv("~/Desktop/MSDS/Doing Data Science/MSDS_6306_Doing-Data-Science-Master/Unit 14 and 15 Case Study 2/CaseStudy2-data.csv", header = TRUE)
dfNBNO= read.csv("~/Desktop/MSDS/Doing Data Science/MSDS_6306_Doing-Data-Science-Master/Unit 14 and 15 Case Study 2/CaseStudy2CompSet No Attrition.csv", header = TRUE)
dfNB$Attrition = factor(as.character(dfNB$Attrition), levels=c('Yes', 'No'))
#70/30 Split of data
splitPerc = .70
trainAttrNB= sample(1:dim(dfNB)[1],round(splitPerc * dim(dfNB)[1]))
trainNB = dfNB[trainAttrNB,]
testNB = dfNB[-trainAttrNB,]
model = naiveBayes(trainNB[,c(2,4,5,15,20)],trainNB$Attrition)
table(predict(model,testNB[,c(2,4,5,15,20)]),testNB$Attrition)
##
## Yes No
## Yes 6 2
## No 33 220
confusionMatrix(table(predict(model,testNB[,c(2,4,5,15,20)]),testNB$Attrition))
## Confusion Matrix and Statistics
##
##
## Yes No
## Yes 6 2
## No 33 220
##
## Accuracy : 0.8659
## 95% CI : (0.8185, 0.9048)
## No Information Rate : 0.8506
## P-Value [Acc > NIR] : 0.276
##
## Kappa : 0.2154
##
## Mcnemar's Test P-Value : 3.959e-07
##
## Sensitivity : 0.15385
## Specificity : 0.99099
## Pos Pred Value : 0.75000
## Neg Pred Value : 0.86957
## Prevalence : 0.14943
## Detection Rate : 0.02299
## Detection Prevalence : 0.03065
## Balanced Accuracy : 0.57242
##
## 'Positive' Class : Yes
##
set.seed(57)
dfRF = read.csv("~/Desktop/MSDS/Doing Data Science/MSDS_6306_Doing-Data-Science-Master/Unit 14 and 15 Case Study 2/CaseStudy2-data.csv", header = TRUE)
dfRFNO= read.csv("~/Desktop/MSDS/Doing Data Science/MSDS_6306_Doing-Data-Science-Master/Unit 14 and 15 Case Study 2/CaseStudy2CompSet No Attrition.csv", header = TRUE)
#Ensure Attrition is changed to factor
dfRF$Attrition = factor(as.character(dfRF$Attrition), levels=c('Yes', 'No'))
#Split test and train data - 70/30
splitPerc = .70
trainAttrition= sample(1:dim(dfRF)[1],round(splitPerc * dim(dfRF)[1]))
trainRF = dfRF[trainAttrition,]
testRF = dfRF[-trainAttrition,]
#Apply Random Forest using Monthly Income to test data
EmpAttRF = randomForest(Attrition ~ .-MonthlyRate,
data=trainRF,
strata=trainRF$Attrition,
sampsize= c(55,55))
#Use newly trained data set to predict test set
AttPredRF = predict(EmpAttRF,newdata= testRF)
#Create confusion matrix to assess accuracy stats
confusionMatrix(AttPredRF, testRF$Attrition)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Yes No
## Yes 24 35
## No 15 187
##
## Accuracy : 0.8084
## 95% CI : (0.7554, 0.8543)
## No Information Rate : 0.8506
## P-Value [Acc > NIR] : 0.97406
##
## Kappa : 0.3779
##
## Mcnemar's Test P-Value : 0.00721
##
## Sensitivity : 0.61538
## Specificity : 0.84234
## Pos Pred Value : 0.40678
## Neg Pred Value : 0.92574
## Prevalence : 0.14943
## Detection Rate : 0.09195
## Detection Prevalence : 0.22605
## Balanced Accuracy : 0.72886
##
## 'Positive' Class : Yes
##
#checking importance of variables
varImp(EmpAttRF)
## Overall
## ID 2.6351117
## Age 3.2481944
## BusinessTravel 0.6207013
## DailyRate 2.6712124
## Department 0.6207745
## DistanceFromHome 2.2172413
## Education 1.0534755
## EducationField 0.9551366
## EmployeeCount 0.0000000
## EmployeeNumber 2.5074859
## EnvironmentSatisfaction 1.7172361
## Gender 0.4092124
## HourlyRate 2.4749805
## JobInvolvement 1.5392299
## JobLevel 1.3038299
## JobRole 1.8645592
## JobSatisfaction 1.1553579
## MaritalStatus 1.5691391
## MonthlyIncome 3.5155931
## NumCompaniesWorked 1.7960033
## Over18 0.0000000
## OverTime 2.9664941
## PercentSalaryHike 1.8021369
## PerformanceRating 0.2857546
## RelationshipSatisfaction 1.2148552
## StandardHours 0.0000000
## StockOptionLevel 2.2775792
## TotalWorkingYears 2.6872190
## TrainingTimesLastYear 1.3119910
## WorkLifeBalance 0.9264230
## YearsAtCompany 2.6117705
## YearsInCurrentRole 1.7407151
## YearsSinceLastPromotion 1.4063026
## YearsWithCurrManager 1.8758554
varImpPlot(EmpAttRF)
#Apply Random Forest to the output file
EmpAttRF2 = randomForest(Attrition ~ .-MonthlyRate,
data=dfRF,
strata=dfRF$Attrition,
sampsize= c(55,55))
AttPredRF2 = predict(EmpAttRF2,newdata= dfRFNO)
EmpAttrPredRF = data.frame(dfRFNO$ID, AttPredRF2)
AttPredRF2
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## No No Yes No No No No No No Yes Yes Yes No No Yes No No No No No
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## No No No Yes No No No Yes No Yes No No No No Yes Yes No No No No
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## No Yes No No Yes No No No No No No Yes No No No No Yes No No No
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## No No Yes Yes No Yes No No Yes Yes No No No No No Yes No No No No
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## No No No No No No No No No Yes No No No No No No No No No Yes
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## No No No No Yes No No No Yes No Yes No No No No No No No Yes No
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
## No No No No No No Yes No No No No No No Yes No No No No Yes Yes
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## No No No No No No No Yes No No No No No No Yes No No Yes Yes No
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## No No No No No Yes No No No No Yes No No Yes No No No No No No
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
## Yes No No No No No No Yes No No No No No No Yes No No No No No
## 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
## No No No No No Yes No No No No No No No No No Yes Yes No No No
## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## No Yes No No No No No No Yes No Yes Yes No No Yes No Yes No No No
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260
## No No No No No No Yes Yes No Yes Yes No No Yes No No No No Yes No
## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280
## No No No No No No No No Yes No No Yes No Yes Yes Yes No No No Yes
## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300
## No No No No Yes Yes No Yes Yes Yes No No No No Yes No No Yes No No
## Levels: Yes No
write.csv(AttPredRF2, "~/Desktop/MSDS/Doing Data Science/MSDS_6306_Doing-Data-Science-Master/Unit 14 and 15 Case Study 2/Case2PredictionsHoskinsAttrition.csv")
#Random Forest for Predicting Salary
SalaryRF = randomForest(MonthlyIncome ~ .-MonthlyRate, data=trainRF)
rfPredictSalary = predict(SalaryRF, newdata= testRF)
rfPredictSalary
## 5 9 10 11 13 14 21 23
## 3459.377 2534.835 5764.970 18006.362 10125.457 2264.148 5956.085 5672.620
## 25 26 30 36 39 50 58 63
## 6199.242 18169.875 3152.784 2807.316 18136.712 3053.164 3387.780 2803.398
## 65 68 69 70 71 74 76 79
## 2122.708 3138.465 3133.712 4170.691 3498.056 2851.856 5003.711 11406.036
## 84 85 88 89 91 94 96 100
## 5723.429 6253.823 2991.639 8518.763 2069.538 5827.152 6966.999 2948.442
## 104 105 116 117 118 124 129 139
## 2935.228 16185.111 8814.213 5397.536 5457.455 5709.083 5509.691 3031.284
## 144 150 151 156 160 161 162 174
## 5748.762 5543.774 5986.829 4832.396 11624.838 5322.309 8297.587 9332.028
## 176 178 180 182 192 196 197 199
## 2734.003 3097.022 5662.169 5979.547 9264.912 2494.804 3554.258 11466.202
## 203 208 211 215 218 219 224 232
## 5763.963 6158.477 15090.955 3216.005 14790.109 2893.334 2155.117 2774.496
## 233 235 238 240 242 246 248 253
## 13835.235 14759.111 5639.551 11070.713 17401.089 3155.119 2759.906 2423.879
## 256 257 264 265 267 268 270 271
## 2890.881 2750.577 2664.478 3035.004 12220.304 2392.374 5617.473 5698.419
## 280 294 297 304 305 307 309 310
## 5449.110 11041.740 5296.663 5098.914 2936.572 2824.330 5497.761 5055.945
## 311 316 317 319 325 335 340 343
## 5843.176 5627.656 3406.174 5915.705 15382.402 17321.663 2420.453 9049.609
## 344 345 353 354 357 359 363 367
## 2098.683 5897.098 5384.822 5761.515 3088.118 5082.421 17773.419 4598.717
## 373 377 382 394 397 404 406 408
## 11402.384 4455.650 3478.120 11167.656 15528.321 9353.234 15262.316 3099.338
## 415 416 417 419 422 423 434 436
## 2814.082 5430.155 3378.030 5519.022 5300.364 5839.348 2095.193 5400.852
## 439 440 443 445 447 454 456 457
## 3117.054 5231.434 2978.855 17834.848 5491.721 5085.666 17845.912 5940.055
## 460 461 463 471 472 475 478 480
## 4952.962 3093.328 10463.792 3200.166 2114.316 2922.684 5755.807 10532.424
## 483 487 490 491 496 498 502 503
## 5189.707 2294.234 5673.615 16933.023 14744.177 2968.331 3186.890 4650.169
## 509 511 515 517 518 523 524 529
## 3535.366 5994.652 2746.383 2180.408 3152.590 11880.651 17535.001 3490.983
## 532 533 541 542 546 548 549 550
## 5937.760 3403.356 5195.989 3502.790 3370.035 5181.644 10976.907 3150.503
## 554 558 559 566 569 573 574 575
## 3590.147 3045.383 5341.219 3327.650 5532.320 5617.879 5465.769 6083.546
## 578 580 582 584 585 588 593 594
## 2785.844 5805.294 13370.645 5780.451 5529.600 3760.821 3244.649 3466.373
## 597 598 604 607 612 617 621 623
## 10164.309 5622.336 3416.027 6008.514 5378.577 3021.166 2032.931 3171.736
## 627 632 633 634 638 641 644 650
## 4847.439 9891.293 6306.934 3073.819 8835.707 2088.890 2896.544 3298.699
## 663 664 665 677 678 679 685 686
## 9739.103 2766.161 5657.371 5201.885 5341.447 5748.168 3065.647 9883.816
## 691 692 694 697 699 700 701 706
## 5441.837 9333.649 2780.868 15656.795 2155.440 17869.460 2227.875 5603.159
## 711 714 718 723 727 729 731 734
## 9576.638 10597.453 5180.199 3185.659 3575.795 2108.447 4994.044 5967.689
## 739 740 742 743 750 756 757 760
## 5111.262 5064.696 9231.377 5129.262 2994.920 3098.431 5703.018 17700.392
## 763 765 767 771 775 777 779 783
## 6290.301 3011.546 3269.227 5649.072 11254.254 3030.623 3381.033 3124.199
## 785 787 788 789 793 798 808 809
## 4924.975 10428.742 5596.574 17364.442 9547.543 2313.726 3037.742 3250.919
## 813 814 817 819 820 821 826 832
## 4847.260 17375.355 5982.544 3310.450 6017.337 3231.033 17358.987 2151.670
## 834 837 841 846 847 850 854 857
## 2166.949 3286.376 2022.629 4743.451 2230.513 3541.493 5622.431 3057.628
## 860 862 865 867 868
## 9025.307 11145.113 9739.590 5080.244 5567.957
RMSE(testRF$MonthlyIncome, rfPredictSalary)
## [1] 1278.822
plot(testRF$MonthlyIncome, rfPredictSalary, type = "p", main= "Random Forest RMSE for Predicting Salary", xlab= "Monthly Income", ylab= "Monthly Income Prediction")
varImp(SalaryRF)
## Overall
## ID 108353760
## Age 631345325
## Attrition 15554233
## BusinessTravel 23390729
## DailyRate 97373406
## Department 68784568
## DistanceFromHome 87386322
## Education 44281962
## EducationField 31019256
## EmployeeCount 0
## EmployeeNumber 79525766
## EnvironmentSatisfaction 39786380
## Gender 14684276
## HourlyRate 95135160
## JobInvolvement 20975466
## JobLevel 6043491747
## JobRole 385280923
## JobSatisfaction 27032556
## MaritalStatus 18877817
## NumCompaniesWorked 98007148
## Over18 0
## OverTime 10514711
## PercentSalaryHike 58571208
## PerformanceRating 4273737
## RelationshipSatisfaction 26087941
## StandardHours 0
## StockOptionLevel 20013134
## TotalWorkingYears 3007766939
## TrainingTimesLastYear 35859135
## WorkLifeBalance 23289964
## YearsAtCompany 673514834
## YearsInCurrentRole 217619204
## YearsSinceLastPromotion 138133424
## YearsWithCurrManager 118881181
varImpPlot(SalaryRF)
#random forest model 2: Updating inputs
SalaryRF2 = randomForest(MonthlyIncome ~ JobLevel+TotalWorkingYears+Age, data=trainRF)
rfPredictSalary2 = predict(SalaryRF2, newdata= testRF)
rfPredictSalary2
## 5 9 10 11 13 14 21 23
## 3008.463 2475.905 5797.711 18450.918 10199.909 2298.563 5567.769 5744.283
## 25 26 30 36 39 50 58 63
## 5959.013 18844.024 3038.743 2960.064 17794.401 2964.229 3182.236 2678.947
## 65 68 69 70 71 74 76 79
## 2273.065 2853.009 2984.358 3884.374 3377.593 2853.345 5197.642 11054.829
## 84 85 88 89 91 94 96 100
## 5838.737 5986.942 2984.358 8354.179 2647.326 5491.053 5366.827 3068.278
## 104 105 116 117 118 124 129 139
## 2853.009 15714.899 9157.574 4991.568 5314.910 5653.676 5251.541 3217.573
## 144 150 151 156 160 161 162 174
## 5813.691 5865.577 5759.602 4848.980 10891.505 5720.599 8303.594 9649.549
## 176 178 180 182 192 196 197 199
## 2535.340 3075.014 5770.316 5748.891 9539.973 2731.734 3319.244 12050.701
## 203 208 211 215 218 219 224 232
## 5651.942 5838.648 15556.747 2992.408 15311.929 2930.211 2441.334 2690.296
## 233 235 238 240 242 246 248 253
## 15577.559 16442.024 5867.964 10374.858 18321.834 2961.371 2701.717 2681.594
## 256 257 264 265 267 268 270 271
## 2828.556 2690.104 2687.411 3008.463 12452.683 2434.842 5387.653 5527.884
## 280 294 297 304 305 307 309 310
## 4702.741 9797.193 5473.149 5491.053 3021.124 3100.679 5186.338 4922.377
## 311 316 317 319 325 335 340 343
## 5504.799 5282.133 3123.006 5524.242 14092.859 16669.902 2412.614 9718.183
## 344 345 353 354 357 359 363 367
## 2298.563 5469.644 5647.227 5504.631 3093.587 5429.614 18174.279 5434.657
## 373 377 382 394 397 404 406 408
## 11279.717 3602.378 3141.360 10638.489 16179.104 9103.711 15850.122 2945.371
## 415 416 417 419 422 423 434 436
## 2715.040 5817.733 3173.188 5881.461 5946.085 5574.238 2342.823 5320.210
## 439 440 443 445 447 454 456 457
## 2988.379 5711.964 3093.627 18607.959 5801.927 5458.289 18344.197 5774.538
## 460 461 463 471 472 475 478 480
## 5524.908 3011.717 10910.614 2964.229 2342.823 3040.921 5364.661 11108.015
## 483 487 490 491 496 498 502 503
## 5381.981 2298.563 5473.149 18118.579 15614.721 3348.734 3143.849 3514.469
## 509 511 515 517 518 523 524 529
## 3376.636 5711.964 2597.097 2315.613 2896.788 11377.025 17661.102 2983.840
## 532 533 541 542 546 548 549 550
## 5647.890 3182.236 5720.599 2968.615 3041.749 5687.082 10257.554 2953.780
## 554 558 559 566 569 573 574 575
## 3555.923 3123.006 5692.050 2831.185 5692.050 5539.571 5473.149 5686.839
## 578 580 582 584 585 588 593 594
## 2851.986 5507.027 13108.061 5414.099 5220.919 3055.392 3093.627 3252.197
## 597 598 604 607 612 617 621 623
## 10232.259 5353.142 3182.236 5082.515 5669.613 2995.431 2286.457 3041.749
## 627 632 633 634 638 641 644 650
## 4733.090 10023.639 5801.927 2945.371 9038.502 2298.563 3072.309 3514.229
## 663 664 665 677 678 679 685 686
## 9906.712 2688.288 5473.149 5524.908 5687.082 5986.942 3093.627 10100.311
## 691 692 694 697 699 700 701 706
## 5711.964 9133.996 2643.380 15121.010 2043.532 18486.786 2342.823 5509.603
## 711 714 718 723 727 729 731 734
## 9255.106 10150.552 5244.536 3041.749 3471.478 2362.785 4853.189 5867.964
## 739 740 742 743 750 756 757 760
## 5024.825 5527.681 8939.384 5819.230 3032.020 3062.688 5491.053 18502.902
## 763 765 767 771 775 777 779 783
## 5415.609 3134.003 3389.893 5430.533 12200.578 3013.502 3124.524 3012.388
## 785 787 788 789 793 798 808 809
## 4831.324 10759.015 5623.726 18617.707 9003.591 2344.818 2558.658 3245.614
## 813 814 817 819 820 821 826 832
## 4733.090 18598.363 5197.642 3013.502 5865.577 3103.398 18556.835 2117.902
## 834 837 841 846 847 850 854 857
## 2178.375 3175.668 2043.532 5244.536 2178.375 3141.360 5288.429 2986.634
## 860 862 865 867 868
## 9240.192 11772.029 9324.290 4938.954 5729.469
varImp(SalaryRF2)
## Overall
## JobLevel 5996246926
## TotalWorkingYears 4335279959
## Age 1305959303
varImpPlot(SalaryRF2)
#RMSE of Random Forest model for predicting salary
RMSE(testRF$MonthlyIncome, rfPredictSalary2)
## [1] 1242.685
plot(testRF$MonthlyIncome, rfPredictSalary2, type = "p", main= "Random Forest RMSE for Predicting Salary", xlab= "Monthly Income", ylab= "Monthly Income Prediction")
#Linear regression model
SalaryRM =lm(MonthlyIncome ~ JobLevel+TotalWorkingYears+Age, data=trainRF)
lmPredictSalary= predict(SalaryRM, newdata= testRF)
lmPredictSalary
## 5 9 10 11 13 14 21 23
## 2278.947 2000.145 6067.186 17820.044 10247.304 1998.899 6349.725 6182.756
## 25 26 30 36 39 50 58 63
## 5954.107 18102.272 2231.286 2225.055 18669.843 2279.881 2506.038 2111.977
## 65 68 69 70 71 74 76 79
## 1997.030 2222.252 2225.990 2903.213 2622.542 2284.243 6291.473 10588.719
## 84 85 88 89 91 94 96 100
## 6350.037 6631.642 2225.990 9793.745 2001.391 6180.264 6746.278 2338.445
## 104 105 116 117 118 124 129 139
## 2222.252 14940.168 9853.244 5895.232 6178.395 6068.432 6746.589 2396.386
## 144 150 151 156 160 161 162 174
## 6011.737 5952.550 6122.634 5840.406 10418.635 6178.706 9850.440 9851.998
## 176 178 180 182 192 196 197 199
## 2052.479 2279.570 6293.965 6237.582 9794.991 2005.441 2512.891 10644.791
## 203 208 211 215 218 219 224 232
## 6404.862 6575.882 14147.063 2223.498 14318.394 2224.432 1998.276 2056.840
## 233 235 238 240 242 246 248 253
## 14374.466 14373.220 5953.796 10364.744 17877.050 2393.271 2112.289 2003.572
## 256 257 264 265 267 268 270 271
## 2341.249 2055.594 2111.354 2278.947 13862.966 1999.522 6009.556 6183.379
## 280 294 297 304 305 307 309 310
## 5848.194 10081.892 6179.952 6180.264 2345.922 2399.813 6581.177 5838.848
## 311 316 317 319 325 335 340 343
## 6123.569 6127.619 2282.062 6068.743 14146.440 18332.478 1997.965 9795.614
## 344 345 353 354 357 359 363 367
## 1998.899 6065.939 6520.121 6069.055 2223.186 6122.946 18499.447 6461.869
## 373 377 382 394 397 404 406 408
## 10645.103 2959.285 2392.336 10537.009 14430.538 9796.549 14147.687 2339.068
## 415 416 417 419 422 423 434 436
## 2112.912 6462.180 2619.116 6406.420 6462.803 6291.784 1998.588 6065.005
## 439 440 443 445 447 454 456 457
## 2117.896 6181.198 2223.809 17990.128 6688.648 6179.641 17819.109 6179.018
## 460 461 463 471 472 475 478 480
## 6180.575 2397.321 10195.594 2279.881 1998.588 2167.426 6064.693 10191.544
## 483 487 490 491 496 498 502 503
## 6518.252 1998.899 6179.952 17878.296 14601.245 2170.230 2280.193 3019.406
## 509 511 515 517 518 523 524 529
## 2677.680 6181.198 2108.862 1997.341 2340.937 10701.486 18385.746 2282.373
## 532 533 541 542 546 548 549 550
## 6462.492 2506.038 6178.706 2732.817 2507.595 6181.821 10361.005 2285.489
## 554 558 559 566 569 573 574 575
## 2681.106 2282.062 6182.444 2285.177 6182.444 6008.933 6179.952 5951.927
## 578 580 582 584 585 588 593 594
## 2341.560 5952.861 13865.458 6065.316 6126.684 2508.218 2223.809 2512.268
## 597 598 604 607 612 617 621 623
## 10190.921 6745.655 2506.038 6751.574 6182.133 2394.517 1995.784 2507.595
## 627 632 633 634 638 641 644 650
## 5894.920 10077.843 6688.648 2339.068 9795.926 1998.899 2225.679 2509.464
## 663 664 665 677 678 679 685 686
## 10025.197 2113.223 6179.952 6180.575 6181.821 6631.642 2223.809 10020.525
## 691 692 694 697 699 700 701 706
## 6181.198 9856.359 2111.666 14262.011 1938.777 18046.511 1998.588 6520.433
## 711 714 718 723 727 729 731 734
## 9794.680 10191.232 6178.083 2507.595 2508.530 1997.653 5844.456 5953.796
## 739 740 742 743 750 756 757 760
## 5898.347 6123.880 9851.375 6180.887 2168.984 2454.327 6180.264 18102.583
## 763 765 767 771 775 777 779 783
## 6689.583 2169.295 2509.776 6123.257 10644.168 2280.816 2335.953 2338.134
## 785 787 788 789 793 798 808 809
## 5845.702 10139.522 6241.943 18556.765 9853.555 1996.407 2052.790 2506.349
## 813 814 817 819 820 821 826 832
## 5894.920 18159.590 6291.473 2280.816 5952.550 2281.439 17933.745 1939.089
## 834 837 841 846 847 850 854 857
## 1995.472 2395.763 1938.777 6178.083 1995.472 2392.336 6121.700 2393.582
## 860 862 865 867 868
## 9855.424 10588.096 9855.736 5841.029 6240.385
RMSE(testRF$MonthlyIncome, lmPredictSalary)
## [1] 1333.81
plot(testRF$MonthlyIncome, lmPredictSalary, type = "p", main= "Linear Regression Model for Predicting Salary", xlab= "Monthly Income", ylab= "Monthly Income Prediction")
abline(SalaryRM)
## Warning in abline(SalaryRM): only using the first two of 4 regression
## coefficients
summary(lmPredictSalary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1939 2284 6065 6145 6690 18670
#Load in Data set with no salary information
NOSalary = read_xlsx('~/Desktop/MSDS/Doing Data Science/MSDS_6306_Doing-Data-Science-Master/Unit 14 and 15 Case Study 2/CaseStudy2CompSet No Salary.xlsx')
#Apply Random Forest to the output file
SalPredRFFinal = predict(SalaryRF2,
newdata= NOSalary)
Case2PredictionsHoskinsSalary = data.frame(NOSalary$ID, SalPredRFFinal)
Case2PredictionsHoskinsSalary
## NOSalary.ID SalPredRFFinal
## 1 871 5304.535
## 2 872 2962.120
## 3 873 15934.312
## 4 874 3017.748
## 5 875 2951.358
## 6 876 5583.527
## 7 877 5746.836
## 8 878 2535.340
## 9 879 3245.614
## 10 880 15614.721
## 11 881 10500.301
## 12 882 3062.687
## 13 883 5507.027
## 14 884 4871.505
## 15 885 5314.910
## 16 886 4297.845
## 17 887 5081.805
## 18 888 5763.728
## 19 889 5774.538
## 20 890 2983.840
## 21 891 5558.351
## 22 892 9255.106
## 23 893 8838.062
## 24 894 5683.140
## 25 895 9682.719
## 26 896 9333.736
## 27 897 9310.877
## 28 898 15970.982
## 29 899 5473.149
## 30 900 3029.811
## 31 901 2964.229
## 32 902 5427.416
## 33 903 5441.961
## 34 904 3132.404
## 35 905 16161.910
## 36 906 4965.452
## 37 907 10796.581
## 38 908 5824.877
## 39 909 2782.612
## 40 910 2434.842
## 41 911 18770.703
## 42 912 3029.123
## 43 913 3062.687
## 44 914 10023.639
## 45 915 5507.117
## 46 916 4363.694
## 47 917 2740.051
## 48 918 4886.987
## 49 919 2441.334
## 50 920 3460.665
## 51 921 2956.346
## 52 922 2394.111
## 53 923 5549.060
## 54 924 5692.050
## 55 925 16141.959
## 56 926 3514.229
## 57 927 5558.044
## 58 928 10072.582
## 59 929 2344.818
## 60 930 3104.621
## 61 931 5434.657
## 62 932 9074.923
## 63 933 9175.125
## 64 934 2856.645
## 65 935 2286.457
## 66 936 2973.725
## 67 937 10635.370
## 68 938 10602.826
## 69 939 5824.482
## 70 940 2441.334
## 71 941 3414.173
## 72 942 9281.290
## 73 943 10838.222
## 74 944 4813.908
## 75 945 2043.532
## 76 946 5507.027
## 77 947 2344.922
## 78 948 2662.778
## 79 949 3141.360
## 80 950 5209.809
## 81 951 3343.341
## 82 952 5504.631
## 83 953 5824.877
## 84 954 2964.229
## 85 955 5547.831
## 86 956 9147.207
## 87 957 11337.984
## 88 958 3240.629
## 89 959 11769.876
## 90 960 16851.926
## 91 961 2966.576
## 92 962 5558.351
## 93 963 5742.041
## 94 964 5627.748
## 95 965 2615.756
## 96 966 18619.274
## 97 967 2982.364
## 98 968 9006.531
## 99 969 14538.395
## 100 970 3093.587
## 101 971 5387.106
## 102 972 5774.538
## 103 973 3093.627
## 104 974 2286.457
## 105 975 4733.090
## 106 976 3010.270
## 107 977 5382.184
## 108 978 3226.564
## 109 979 10199.909
## 110 980 18174.516
## 111 981 2983.840
## 112 982 10542.947
## 113 983 3071.907
## 114 984 3093.587
## 115 985 5668.982
## 116 986 3104.621
## 117 987 2344.818
## 118 988 5651.065
## 119 989 11511.704
## 120 990 2678.947
## 121 991 5658.706
## 122 992 3026.345
## 123 993 5197.642
## 124 994 5009.708
## 125 995 5897.564
## 126 996 5381.981
## 127 997 2901.140
## 128 998 15469.578
## 129 999 8917.353
## 130 1000 2434.842
## 131 1001 5740.831
## 132 1002 4878.340
## 133 1003 5558.044
## 134 1004 5266.844
## 135 1005 3312.803
## 136 1006 5504.799
## 137 1007 2480.988
## 138 1008 5711.964
## 139 1009 18637.982
## 140 1010 5756.857
## 141 1011 10863.502
## 142 1012 5512.128
## 143 1013 5588.438
## 144 1014 5518.997
## 145 1015 5615.101
## 146 1016 2848.938
## 147 1017 3343.341
## 148 1018 8970.046
## 149 1019 8493.887
## 150 1020 2043.532
## 151 1021 5293.760
## 152 1022 5580.592
## 153 1023 5456.013
## 154 1024 11377.025
## 155 1025 15556.747
## 156 1026 16527.611
## 157 1027 5759.602
## 158 1028 4991.568
## 159 1029 4718.366
## 160 1030 12050.701
## 161 1031 4938.954
## 162 1032 3269.831
## 163 1033 8354.179
## 164 1034 2362.785
## 165 1035 5719.457
## 166 1036 2286.457
## 167 1037 5810.916
## 168 1038 11134.978
## 169 1039 9295.134
## 170 1040 5081.805
## 171 1041 3063.559
## 172 1042 3135.998
## 173 1043 9006.531
## 174 1044 5986.942
## 175 1045 3093.627
## 176 1046 2441.334
## 177 1047 5620.108
## 178 1048 9402.074
## 179 1049 3008.992
## 180 1050 5626.213
## 181 1051 10790.389
## 182 1052 2986.563
## 183 1053 5323.011
## 184 1054 3143.849
## 185 1055 2964.229
## 186 1056 3166.564
## 187 1057 5683.140
## 188 1058 15994.881
## 189 1059 6010.819
## 190 1060 3143.849
## 191 1061 5692.283
## 192 1062 5504.799
## 193 1063 5402.700
## 194 1064 2639.917
## 195 1065 5634.594
## 196 1066 2853.009
## 197 1067 4030.427
## 198 1068 17128.762
## 199 1069 4925.990
## 200 1070 3110.748
## 201 1071 2992.408
## 202 1072 5730.864
## 203 1073 2480.988
## 204 1074 5720.599
## 205 1075 4987.877
## 206 1076 15669.934
## 207 1077 18637.982
## 208 1078 5774.538
## 209 1079 5524.908
## 210 1080 5514.817
## 211 1081 3132.404
## 212 1082 3098.854
## 213 1083 2267.482
## 214 1084 5817.733
## 215 1085 9003.591
## 216 1086 10944.394
## 217 1087 5757.456
## 218 1088 3152.949
## 219 1089 3063.559
## 220 1090 5683.140
## 221 1091 5250.588
## 222 1092 5647.890
## 223 1093 9157.574
## 224 1094 5414.213
## 225 1095 5810.916
## 226 1096 5824.877
## 227 1097 3884.374
## 228 1098 3013.502
## 229 1099 3104.621
## 230 1100 5819.230
## 231 1101 18344.197
## 232 1102 5504.631
## 233 1103 5742.041
## 234 1104 11002.628
## 235 1105 2986.563
## 236 1106 2961.371
## 237 1107 3416.574
## 238 1108 5353.276
## 239 1109 5387.106
## 240 1110 5788.221
## 241 1111 5817.733
## 242 1112 9597.015
## 243 1113 10792.500
## 244 1114 5472.285
## 245 1115 10401.871
## 246 1116 3008.992
## 247 1117 6388.237
## 248 1118 5244.536
## 249 1119 11082.240
## 250 1120 3042.829
## 251 1121 10436.659
## 252 1122 15950.877
## 253 1123 5774.538
## 254 1124 15220.704
## 255 1125 2734.427
## 256 1126 2178.375
## 257 1127 3008.463
## 258 1128 5323.253
## 259 1129 15545.972
## 260 1130 2829.811
## 261 1131 9539.973
## 262 1132 14092.859
## 263 1133 5692.050
## 264 1134 3209.141
## 265 1135 3240.629
## 266 1136 3075.014
## 267 1137 3884.374
## 268 1138 5669.613
## 269 1139 10681.743
## 270 1140 12464.814
## 271 1141 5774.538
## 272 1142 5692.050
## 273 1143 10397.620
## 274 1144 4859.967
## 275 1145 5339.385
## 276 1146 5250.925
## 277 1147 5885.300
## 278 1148 2958.344
## 279 1149 2687.411
## 280 1150 5559.833
## 281 1151 3209.141
## 282 1152 5320.210
## 283 1153 5278.766
## 284 1154 2441.334
## 285 1155 5518.762
## 286 1156 15473.817
## 287 1157 2982.364
## 288 1158 5647.890
## 289 1159 2954.485
## 290 1160 5580.592
## 291 1161 2831.185
## 292 1162 9003.591
## 293 1163 2986.155
## 294 1164 9270.753
## 295 1165 3141.360
## 296 1166 2718.669
## 297 1167 10219.464
## 298 1168 5605.982
## 299 1169 2964.229
## 300 1170 3175.668
write.csv(Case2PredictionsHoskinsSalary, "~/Desktop/MSDS/Doing Data Science/MSDS_6306_Doing-Data-Science-Master/Unit 14 and 15 Case Study 2/Case2PredictionsHoskinsSalary.csv")